
6.4. Recent progress in image recognition 199
The second hidden layer is also a convolutional layer, with a max-pooling step. It uses
5
×
5 local receptive fields, and there’s a total of 256 feature maps, split into 128 on each
GPU. Note that the feature maps only use 48 input channels, not the full 96 output from the
previous layer (as would usually be the case). This is because any single feature map only
uses inputs from the same GPU. In this sense the network departs from the convolutional
architecture we described earlier in the chapter, though obviously the basic idea is still the
same.
The third, fourth and fifth hidden layers are convolutional layers, but unlike the previous
layers, they do not involve max-pooling. Their respectives parameters are: (3) 384 feature
maps, with 3
×
3 local receptive fields, and 256 input channels; (4) 384 feature maps, with
3
×
3 local receptive fields, and 192 input channels; and (5) 256 feature maps, with 3
×
3 local
receptive fields, and 192 input channels. Note that the third layer involves some inter-GPU
communication (as depicted in the figure) in order that the feature maps use all 256 input
channels.
The sixth and seventh hidden layers are fully-connected layers, with 4,096 neurons in
each layer.
The output layer is a 1,000-unit softmax layer.
The KSH network takes advantage of many techniques. Instead of using the sigmoid or
tanh activation functions, KSH use rectified linear units, which sped up training significantly.
KSH’s network had roughly 60 million learned parameters, and was thus, even with the large
training set, susceptible to overfitting. To overcome this, they expanded the training set using
the random cropping strategy we discussed above. They also further addressed overfitting
by using a variant of l2 regularization, and dropout. The network itself was trained using
momentum-based mini-batch stochastic gradient descent.
That’s an overview of many of the core ideas in the KSH paper. I’ve omitted some
details, for which you should look at the paper. You can also look at Alex Krizhevsky’s
cuda-convnet (and successors), which contains code implementing many of the ideas. A
Theano-based implementation has also been developed
27
, with the code available here. The
code is recognizably along similar lines to that developed in this chapter, although the use of
multiple GPUs complicates things somewhat. The Caffe neural nets framework also includes
a version of the KSH network, see their Model Zoo for details.
The 2014 ILSVRC competition:
Since 2012, rapid progress continues to be made.
Consider the 2014 ILSVRC competition. As in 2012, it involved a training set of 1.2 million
images, in 1,000 categories, and the figure of merit was whether the top 5 predictions
included the correct category. The winning team, based primarily at Google
28
, used a deep
convolutional network with 22 layers of neurons. They called their network GoogLeNet,
as a homage to LeNet-5. GoogLeNet achieved a top-5 accuracy of 93.33 percent, a giant
improvement over the 2013 winner (Clarifai, with 88.3 percent), and the 2012 winner (KSH,
with 84.7 percent).
Just how good is GoogLeNet’s 93.33 percent accuracy? In 2014 a team of researchers
wrote a survey paper about the ILSVRC competition
29
. One of the questions they address is
how well humans perform on ILSVRC. To do this, they built a system which lets humans
27
Theano-based large-scale visual recognition with multiple GPUs, by Weiguang Ding, Ruoyan Wang,
Fei Mao, and Graham Taylor (2014).
28
Going deeper with convolutions, by Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott
Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich (2014).
29
ImageNet large scale visual recognition challenge, by Olga Russakovsky, Jia Deng, Hao Su, Jonathan
Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein,
Alexander C. Berg, and Li Fei-Fei (2014).
6